AITopics | automated essay scoring

Collaborating Authors

automated essay scoring

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Operationalizing Automated Essay Scoring: A Human-Aware Approach

Plasencia-Calaña, Yenisel

arXiv.org Artificial IntelligenceOct-20-2025

This paper explores the human-centric operationalization of Automated Essay Scoring (AES) systems, addressing aspects beyond accuracy. We compare various machine learning-based approaches with Large Language Models (LLMs) approaches, identifying their strengths, similarities and differences. The study investigates key dimensions such as bias, robustness, and explainability, considered important for human-aware operationalization of AES systems. Our study shows that ML-based AES models outperform LLMs in accuracy but struggle with explainability, whereas LLMs provide richer explanations. We also found that both approaches struggle with bias and robustness to edge scores. By analyzing these dimensions, the paper aims to identify challenges and trade-offs between different methods, contributing to more reliable and trustworthy AES methods.

explanation, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2506.21603

Genre: Research Report (1.00)

Industry:

Education > Educational Setting (1.00)
Education > Assessment & Standards > Student Performance (1.00)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.86)
Education > Educational Technology > Educational Software > Computer Based Training (0.72)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Long Context Automated Essay Scoring with Language Models

Ormerod, Christopher, Kehat, Gitit

arXiv.org Artificial IntelligenceSep-15-2025

Transformer-based language models are architecturally constrained to process text of a fixed maximum length. Essays written by higher-grade students frequently exceed the maximum allowed length for many popular open-source models. A common approach to addressing this issue when using these models for Automated Essay Scoring is to truncate the input text. This raises serious validity concerns as it undermines the model's ability to fully capture and evaluate organizational elements of the scoring rubric, which requires long contexts to assess. In this study, we evaluate several models that incorporate architectural modifications of the standard transformer architecture to overcome these length limitations using the Kaggle ASAP 2.0 dataset. The models considered in this study include fine-tuned versions of XLNet, Longformer, ModernBERT, Mamba, and Llama models.

arxiv preprint, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.10417

Country: North America > United States (0.28)

Genre: Research Report (0.90)

Industry:

Education > Assessment & Standards > Student Performance (1.00)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.72)
Education > Educational Technology > Educational Software > Computer Based Training (0.61)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Do We Need a Detailed Rubric for Automated Essay Scoring using Large Language Models?

Yoshida, Lui

arXiv.org Artificial IntelligenceMay-5-2025

This study investigates the necessity and impact of a detailed rubric in automated essay scoring (AES) using large language models (LLMs). While using rubrics are standard in LLM - based AES, creating detailed rubric s requires substantial effort and increases token usage. We examined how different levels of rubric detail affect scoring accuracy across multiple LLMs using the TOEFL11 dataset. Our experiments compared three conditions: a full rubric, a simplified rubric, and no rubric, using four different LLMs (Claude 3.5 Haiku, Gemini 1.5 Flash, GPT - 4o - mini, and Llama 3 70B Instruct). Results showed that three out of four models maintained similar scoring accuracy with the simplified rubric compared to the detailed one, while significantly reducing token usage. However, one model (Gemini 1.5 Flash) showed decreased performance with more detailed rubrics. The findings suggest that simplified rubrics may be sufficient for most LLM - based AES applications, offering a more ef ficient alternative without compromising scoring accuracy. However, model - specific evaluation remains crucial as performance patterns vary across different LLMs.

large language model, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2505.01035

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.15)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education > Assessment & Standards > Student Performance (1.00)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.73)
Education > Educational Technology > Educational Software > Computer Based Training (0.73)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring

Cai, Yida, Liang, Kun, Lee, Sanwoo, Wang, Qinghan, Wu, Yunfang

arXiv.org Artificial IntelligenceApr-9-2025

In recent years, large language models (LLMs) achieve remarkable success across a variety of tasks. However, their potential in the domain of Automated Essay Scoring (AES) remains largely underexplored. Moreover, compared to English data, the methods for Chinese AES is not well developed. In this paper, we propose Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities. Specifically, we fine-tune the ranking model (Ranker) with feature-enriched data, and then feed the output of the ranking model, in the form of a candidate score set, with the essay content into the scoring model (Scorer) to produce the final score. Experimental results on two benchmark datasets, HSK and ASAP, demonstrate that RTS consistently outperforms the direct prompting (Vanilla) method in terms of average QWK across all LLMs and datasets, and achieves the best performance on Chinese essay scoring using the HSK dataset.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2504.05736

Genre: Research Report > New Finding (0.46)

Industry:

Education > Assessment & Standards > Student Performance (1.00)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.85)
Education > Educational Technology > Educational Software > Computer Based Training (0.62)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection

Qwaider, Chatrine, Alhafni, Bashar, Chirkunov, Kirill, Habash, Nizar, Briscoe, Ted

arXiv.org Artificial IntelligenceMar-22-2025

Automated Essay Scoring (AES) plays a crucial role in assessing language learners' writing quality, reducing grading workload, and providing real-time feedback. Arabic AES systems are particularly challenged by the lack of annotated essay datasets. This paper presents a novel framework leveraging Large Language Models (LLMs) and Transformers to generate synthetic Arabic essay datasets for AES. We prompt an LLM to generate essays across CEFR proficiency levels and introduce controlled error injection using a fine-tuned Standard Arabic BERT model for error type prediction. Our approach produces realistic human-like essays, contributing a dataset of 3,040 annotated essays. Additionally, we develop a BERT-based auto-marking system for accurate and scalable Arabic essay evaluation. Experimental results demonstrate the effectiveness of our framework in improving Arabic AES performance.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2503.17739

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > New York (0.04)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
(4 more...)

Genre:

Overview (0.93)
Research Report > New Finding (0.48)
Personal > Interview (0.46)

Industry:

Education > Assessment & Standards > Student Performance (1.00)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.71)
Education > Educational Technology > Educational Software > Computer Based Training (0.61)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

The Impact of Example Selection in Few-Shot Prompting on Automated Essay Scoring Using GPT Models

Yoshida, Lui

arXiv.org Artificial IntelligenceNov-28-2024

This study investigates the impact of example selection on the performance of au-tomated essay scoring (AES) using few-shot prompting with GPT models. We evaluate the effects of the choice and order of examples in few-shot prompting on several versions of GPT-3.5 and GPT-4 models. Our experiments involve 119 prompts with different examples, and we calculate the quadratic weighted kappa (QWK) to measure the agreement between GPT and human rater scores. Regres-sion analysis is used to quantitatively assess biases introduced by example selec-tion. The results show that the impact of example selection on QWK varies across models, with GPT-3.5 being more influenced by examples than GPT-4. We also find evidence of majority label bias, which is a tendency to favor the majority la-bel among the examples, and recency bias, which is a tendency to favor the label of the most recent example, in GPT-generated essay scores and QWK, with these biases being more pronounced in GPT-3.5. Notably, careful example selection enables GPT-3.5 models to outperform some GPT-4 models. However, among the GPT models, the June 2023 version of GPT-4, which is not the latest model, exhibits the highest stability and performance. Our findings provide insights into the importance of example selection in few-shot prompting for AES, especially in GPT-3.5 models, and highlight the need for individual performance evaluations of each model, even for minor versions.

gpt-3, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/978-3-031-64315-6_5

2411.18924

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(8 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Assessment & Standards > Student Performance (1.00)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.52)
Education > Educational Technology > Educational Software > Computer Based Training (0.43)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Kim, Seungju, Jo, Meounggun

arXiv.org Artificial IntelligenceJul-8-2024

Large Language Models (LLMs) have shown promise in Automated Essay Scoring (AES), but their zero-shot and few-shot performance often falls short compared to state-of-the-art models and human raters. However, fine-tuning LLMs for each specific task is impractical due to the variety of essay prompts and rubrics used in real-world educational contexts. This study proposes a novel approach combining LLMs and Comparative Judgment (CJ) for AES, using zero-shot prompting to choose between two essays. We demonstrate that a CJ method surpasses traditional rubric-based scoring in essay scoring using LLMs.

criteria, patience, rubric, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3657604.3664703

2407.05733

Country:

Asia > South Korea (0.04)
North America > United States > New York (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Promising Solution (0.86)

Industry:

Education > Assessment & Standards > Student Performance (1.00)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.71)
Education > Educational Technology > Educational Software > Computer Based Training (0.61)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Automated Text Scoring in the Age of Generative AI for the GPU-poor

Ormerod, Christopher Michael, Kwako, Alexander

arXiv.org Artificial IntelligenceJul-1-2024

Generative language models (GLMs), such as GPT-4 [34] and Claude [2], have demonstrated powerful performance across a variety of language and reasoning tasks. In the field of education, researchers are exploring the extent to which these models can perform tasks such as automated essay scoring [56], providing feedback to students [4], individual tutoring [7], and more [15]. Although GLMs show promise in automating certain educative tasks, there are critical limitations that hinder the possibility of wider implementation. For instance, researchers have shown that GLMs can be "jail-broken" to bypass safety guardrails [58] and can disclose personally identifiable information. Large GLMs are extremely large, requiring millions of dollars to train and deploy; as such, they are highly inefficient for specialized tasks [26]. These models are constantly being updated, sometimes leading to degraded performance [6], and they are only accessible via Application Programming Interfaces (APIs), which lead to issues around replicability and leave little room to conduct rigorous research. It is for these reasons that we shift the focus away from large, proprietary GLMs toward smaller, open-source GLMs. In this study, we focus on two educational applications: Automated Text Scoring (ATS) and providing feedback--specifically, feedback that justifies scores based on the scoring rubric. Our study is the first to demonstrate that it is possible to efficiently fine-tune such GLMs to yield high-quality scores, and that (at least some) feedback from fine-tuned models can explain these scores.

glm, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2407.01873

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > New Jersey > Bergen County > Mahwah (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Assessment & Standards > Student Performance (0.57)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.50)

Add feedback

Automated Essay Scoring Using Grammatical Variety and Errors with Multi-Task Learning and Item Response Theory

Doi, Kosuke, Sudoh, Katsuhito, Nakamura, Satoshi

arXiv.org Artificial IntelligenceJun-13-2024

This study examines the effect of grammatical features in automatic essay scoring (AES). We use two kinds of grammatical features as input to an AES model: (1) grammatical items that writers used correctly in essays, and (2) the number of grammatical errors. Experimental results show that grammatical features improve the performance of AES models that predict the holistic scores of essays. Multi-task learning with the holistic and grammar scores, alongside using grammatical features, resulted in a larger improvement in model performance. We also show that a model using grammar abilities estimated using Item Response Theory (IRT) as the labels for the auxiliary task achieved comparable performance to when we used grammar scores assigned by human raters. In addition, we weight the grammatical features using IRT to consider the difficulty of grammatical items and writers' grammar abilities. We found that weighting grammatical features with the difficulty led to further improvement in performance.

computational linguistic, grammatical feature, grammatical item, (15 more...)

arXiv.org Artificial Intelligence

2406.08817

Country:

North America > United States > Washington > King County > Seattle (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(17 more...)

Genre: Research Report > New Finding (0.88)

Industry:

Education > Assessment & Standards > Student Performance (1.00)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Empirical Study of Large Language Models as Automated Essay Scoring Tools in English Composition__Taking TOEFL Independent Writing Task for Example

Xia, Wei, Mao, Shaoguang, Zheng, Chanjing

arXiv.org Artificial IntelligenceJan-7-2024

Large language models have demonstrated exceptional capabilities in tasks involving natural language generation, reasoning, and comprehension. This study aims to construct prompts and comments grounded in the diverse scoring criteria delineated within the official TOEFL guide. The primary objective is to assess the capabilities and constraints of ChatGPT, a prominent representative of large language models, within the context of automated essay scoring. The prevailing methodologies for automated essay scoring involve the utilization of deep neural networks, statistical machine learning techniques, and fine-tuning pre-trained models. However, these techniques face challenges when applied to different contexts or subjects, primarily due to their substantial data requirements and limited adaptability to small sample sizes. In contrast, this study employs ChatGPT to conduct an automated evaluation of English essays, even with a small sample size, employing an experimental approach. The empirical findings indicate that ChatGPT can provide operational functionality for automated essay scoring, although the results exhibit a regression effect. It is imperative to underscore that the effective design and implementation of ChatGPT prompts necessitate a profound domain expertise and technical proficiency, as these prompts are subject to specific threshold criteria. Keywords: ChatGPT, Automated Essay Scoring, Prompt Learning, TOEFL Independent Writing Task

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2401.03401

Country:

Asia > China > Shanghai > Shanghai (0.05)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > Experimental Study (0.86)

Industry:

Education > Educational Technology > Educational Software > Computer-Aided Assessment (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (1.00)
Education > Assessment & Standards > Student Performance (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback